-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite Storage Component in C++ #270
Conversation
✅ Result of Pytest Coverage---------- coverage: platform linux, python 3.11.7-final-0 -----------
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already thank you so much for the work. I did a first review excluding the actual gRPC service and the binary and csv file wrapper since my brain did not focus anymore. I will try to review these tomorrow.
I hope my comments make sense. Let's address these comments, get the integration tests up and running and then I will make another pass when we know everything is working with postgres end to end. I think my biggest note is that I don't understand the multithreading within the FileWatcher and think it can be simplified a lot. For everything else, see the comments :)
modyn/storage/src/internal/filesystem_wrapper/local_filesystem_wrapper.cpp
Outdated
Show resolved
Hide resolved
modyn/storage/src/internal/filesystem_wrapper/local_filesystem_wrapper.cpp
Outdated
Show resolved
Hide resolved
modyn/storage/src/internal/filesystem_wrapper/local_filesystem_wrapper.cpp
Outdated
Show resolved
Hide resolved
modyn/storage/src/internal/file_wrapper/single_sample_file_wrapper.cpp
Outdated
Show resolved
Hide resolved
modyn/storage/src/internal/file_wrapper/single_sample_file_wrapper.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I went through the file wrappers and gRPC servicer.
- The gRPC servicer has the same MT pattern as the file watcher which should be changed.
- The gRPC servicer currently does not use sql partitions if I see correctly (see the comment on it for details) and materializes everything in memory instead like in Python. We should align this.
- For the CSV File wrapper, I wasn't sure whether we should use a library or not.
See the individual comments again for the details. Thank you so much for the work. Sorry for the amount of comments, it's just a huge PR, but I think step-by-step the things are addressable.
As discussed, for now, the idea would be that you adjust according to the review. Then, we get the integration tests up and running (in some places during the review I was not sure whether it will actually work end 2 end on postgres), and then I do another round of review, including the tests that time :) If the integration tests require some setup work I will be happy to assist, but I am not sure where the current issues lie
modyn/storage/src/internal/file_wrapper/binary_file_wrapper.cpp
Outdated
Show resolved
Hide resolved
modyn/storage/src/internal/file_wrapper/binary_file_wrapper.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, another round done. I resolved most of the old comments, with the exception of few of them which were still relevant.
I think a bigger question is how the gRPC server executes things with threads/processes. Furthermore, we should try to use something similar to sqlalchemy's yield_per interface. The servicer probably needs some more work/cleanup, the other parts are looking solid. There, my comments were mostly minor.
I didn't get to review the tests tonight, unfortunately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(viktor in case you see this ignore it it's just for me)
In the interest of #197 and #147 we have reworked the storage component from Python to cpp. In the process of this we have dramatically overworked the workflows involved while maintaining the logic and functionality of the component.
For additional testing we extended the pipeline workflow with additional c++ testing and linting. Note that timing changes from milliseconds to seconds in the database.
Open TODOs for Maxi: